Skip to content

Conversation

@Mohiiit
Copy link
Member

@Mohiiit Mohiiit commented Nov 19, 2025

Graceful Shutdown with Preconfirmed Block Closing

Summary

Implements graceful shutdown for the block production service that properly closes any open preconfirmed block during shutdown, using the executor's existing state without re-execution.

Problem

Previously, when graceful shutdown was triggered, the block production service would exit immediately, leaving any open preconfirmed block unclosed. This required re-execution of transactions on restart, which is inefficient and can cause inconsistencies.

Solution

The implementation adds a graceful shutdown flow that:

  1. Detects cancellation before the batcher completes
  2. Sends CloseBlock command to the executor thread (which is still running)
  3. Waits for EndBlock message from the executor with the block's execution summary
  4. Closes the block using the executor's existing state (no re-execution needed)
  5. Handles edge cases with timeout protection and graceful degradation

Implementation Details

Shutdown Flow

The shutdown process follows 5 concurrent paths in a tokio::select! loop:

  1. Cancellation Detection: Detects shutdown request and attempts to close block
  2. Batcher Completion: Handles batcher task completion and fallback block closing
  3. EndBlock Processing: Receives and processes EndBlock message from executor
  4. Timeout Protection: Prevents infinite waiting with 5-second timeout
  5. Executor Shutdown: Handles executor thread completion

State Management

  • ShutdownState: Encapsulates shutdown state (shutting_down, batcher_completed, end_block_deadline)
  • try_close_block_on_shutdown(): Attempts to close block and returns ControlFlow for loop control

Note

If graceful shutdown fails for any reason, the block remains preconfirmed and is handled on restart via the existing close_preconfirmed_block_if_exists() mechanism.

@Mohiiit Mohiiit self-assigned this Nov 19, 2025
@Mohiiit Mohiiit added the madara label Nov 19, 2025
#[tokio::test]
async fn test_graceful_shutdown_closes_preconfirmed_block(
#[future]
#[with(Duration::from_secs(3000000000), false)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be too big? should we not reduce it?

task.await.unwrap();

// Give a small delay to ensure database writes complete
tokio::time::sleep(Duration::from_millis(100)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed tho? aren't all service (including DB) is closed once we shutdown?

// Graceful shutdown flow:
//
// ┌─────────────────────────────────────────────────────────────────┐
// │ Graceful Shutdown Flow │
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can format this

#[rstest::rstest]
#[timeout(Duration::from_secs(30))]
#[tokio::test]
async fn test_graceful_shutdown_closes_preconfirmed_block(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add some more test cases for edge cases?

Comment on lines 1014 to 1018
if let Some(deadline) = shutdown_state.end_block_deadline.as_ref() {
tokio::time::sleep_until(*deadline).await
} else {
std::future::pending().await
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the if-else statement and directly sleep until deadline

Copy link
Contributor

@apoorvsadana apoorvsadana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, let's try an approach where

  1. we pass cancellation ctx to executor
  2. block production service waits for batcher, executor and for block to be closed

Comment on lines 695 to 697
// Executor channel is closed - executor has already shut down
// This can happen if executor detected channel closure before we sent CloseBlock
// In this case, the block will remain preconfirmed and be handled on restart
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this possible? if yes, we should fix it as a part of the PR. we shouldn't have a known race condition. if not, this comment probably needs to change.

Comment on lines 850 to 892
// 1. Cancellation Signal (ctx.cancel_global())
// │
// ├─> Batcher detects cancellation → exits → closes send_batch channel
// │
// └─> Main loop detects cancellation (via ctx.cancelled())
// │
// ├─> Check: Is there an open preconfirmed block?
// │ │
// │ ├─> YES: Send CloseBlock command to executor
// │ │ │
// │ │ ├─> Executor receives CloseBlock → sets force_close = true
// │ │ │ │
// │ │ │ └─> On next iteration, executor checks force_close
// │ │ │ │
// │ │ │ └─> Calls finalize() → sends EndBlock message
// │ │ │
// │ │ └─> Main loop waits for EndBlock (with timeout)
// │ │ │
// │ │ ├─> EndBlock received → process_reply() closes block
// │ │ │ └─> Return (shutdown complete)
// │ │ │
// │ │ └─> Timeout/Error → block remains preconfirmed (handled on restart)
// │ │
// │ └─> NO: Continue to wait for batcher completion
// │
// 2. Batcher Completion (alternative path if cancellation not detected first)
// │
// └─> Check: Is there an open preconfirmed block?
// │
// └─> Same flow as above (send CloseBlock → wait for EndBlock)
//
// 3. Executor Shutdown
// │
// └─> Executor detects send_batch channel closure → exits gracefully
// └─> Signals via executor.stop channel
//
// Key Safety Features:
// - Timeout protection: Won't wait indefinitely for EndBlock
// - State validation: Only sends CloseBlock when block exists
// - Graceful degradation: If executor already shut down, block handled on restart
// - No re-execution: Uses executor's existing state (no transaction re-execution needed)
//
// Note: The executor thread shuts down, dropping the `executor.stop` channel, therefore closing it as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think these comments are more detailed than they need to be. i am worried they will get outdated very easily because someone will update the executor thread and forget to change this comment because its in a completely different place. maybe this detail makes sense in the PR desc

// - Executor crashes before sending EndBlock
// - EndBlock gets lost in transit
// - Any other unexpected failure
_ = async {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't the services layer already have a deadline after which it forcefully kills the service?

// 4. If CloseBlock succeeds → wait for EndBlock
// 5. If CloseBlock fails or no block → shutdown complete
res = &mut batcher_task, if !shutdown_state.batcher_completed => {
res.context("In batcher task")?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the batcher task closes because of an error, then ? will technically return before we gracefully shutdown?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment still applies?

// 4. If CloseBlock succeeds → wait for EndBlock
// 5. If CloseBlock fails or no block → shutdown complete
res = &mut batcher_task, if !shutdown_state.batcher_completed => {
res.context("In batcher task")?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment still applies?

self.process_reply(reply).await.context("Processing reply from executor thread")?;

// If we're shutting down and just processed EndBlock, shutdown is complete
if shutting_down && is_end_block {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's possible batcher hasn't closed yet right? image a scenario where batcher didn't detect signal yet, shutting_down has been made true and executor closed block because block time was reached.

Comment on lines 871 to 882
if shutting_down {
// Executor exited during shutdown
// If there was a block, executor should have sent EndBlock before exiting
// If we're here, either no block existed or executor panicked
// In case of panic, it will propagate naturally
tracing::debug!("Executor shut down during graceful shutdown");
return Ok(());
}
// Normal executor completion (not during shutdown)
// If executor panicked, recv() will resume the panic (handled by StopErrorReceiver)
res.context("In executor thread")?;
return Ok(());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just check if res is an error, if yes we add context, otherwise we don't? or maybe we add context all the time. unsure what this if condition is solving here.

#[rstest::rstest]
#[timeout(Duration::from_secs(30))]
#[tokio::test]
async fn test_graceful_shutdown_closes_preconfirmed_block_with_multiple_transactions(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's combine this test and the 1 transaction test?
i think having multiple cases like these make more sene when we've smaller unit tests, not in these ones (or the ones we had in cairo native)

Comment on lines 1921 to 1932
// This test verifies that graceful shutdown completes successfully even in edge cases.
// With the new implementation, the executor automatically closes blocks when it detects
// the send_batch channel closure (WaitTxBatchOutcome::Exit). This test ensures shutdown
// completes gracefully regardless of timing.
//
// The scenario: Cancellation → batcher completes → closes send_batch → executor detects Exit
// → executor closes block automatically → sends EndBlock → main loop closes block → shutdown complete
#[rstest::rstest]
#[timeout(Duration::from_secs(30))]
#[tokio::test]
async fn test_graceful_shutdown_closeblock_fails_executor_channel_closed(
#[future]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the edge case here? not sure how is this different from the previous test case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants